The goal of this workshop is to encourage you to be comfortable getting your data into R and using it for your basic data visualisations, summaries, etc.. Going forward, you will expand on these skills, learn some more complex techniques, and produce a statistical workflow for your data. But for the moment, you just need to feel ok about setting up a project, getting your data into R, and looking at it. This is the first and perhaps most important step of any statistical analysis. Further, having your work in R helps you keep a record of what you have done and helps us help you at the data analysis workshops and drop-in sessions.
The strength of R over other languages is that it is built to handle data. We will start by looking at some data from the following paper:
Cuzick, J., Warwick, J., Pinney, E., Duffy, S. W., Cawthorn, S., Howell, A., … & Warren, R. M. (2011). Tamoxifen-induced reduction in mammographic density and breast cancer risk reduction: a nested case–control study. Journal of the National Cancer Institute, 103(9), 744-752.
First, create an Rstudio project for this workshop – you can do this (and switch between projects) using the icon in the top right of your Rstudio window. Make sure you choose an informative name and location (and make sure you know where you have put it).
If you open that location using your finder / windows explorer you’ll see that the Rstudio project is just a folder with a .Rproj file inside – you can create subfolders (like ‘data’ shown here), copy and paste files, etc. as you normally would.
I will provide you with a data file over dropbox or similar. Download it, create a ‘data’ folder in your project, and put the file there.
This data is formatted as a ‘.csv’, which stands for ‘comma separated values’ – it is just a spreadsheet. Before looking at it in R, we can look at it in a text editor or in excel.
It is a single spreadsheet with no formatting – each line is a row, and columns are separated by commas. More specifically, each row represents a patient, and the columns are the relevant measurements/observations/variables. Note that the data starts in the top left, and has a single ‘header’ row with the names of the columns. This is the ideal way to set up your data for analysis. We will talk a little more about spreadsheets tomorrow.
We will jump straight in to looking at the data in R. First, you need to load the ‘tidyverse’ library (you may need to install it if you haven’t).
#install.packages(tidyverse)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Then, we read in the data using the read_csv() function, and assign it to the variable with the name cancer_df. Note: - when you create the file name (with ’‘) you can use tab to autocomplete it and avoid spelling mistakes etc. - the name of the file should appear green in rstudio, the other R code should stay black. -’<-’ means ‘is’, so you might read this line of code as ‘cancer_df is the output of read_csv() of/with the file “data/….csv”’ -
cancer_df <- read_csv('data/Cuzick_2010_breast_cancer_density.csv')
## Rows: 1065 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (5): case, ARM, AGE, BMI, density
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The first thing I would do after reading in a file is look at it.
cancer_df
## # A tibble: 1,065 × 5
## case ARM AGE BMI density
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 38 21.8 40
## 2 0 1 43 32.3 5
## 3 0 1 46 23 45
## 4 0 2 52 19.6 40
## 5 0 1 59 26.2 40
## 6 0 1 62 23.7 80
## 7 0 2 35 27.9 25
## 8 0 1 58 25.8 15
## 9 0 1 51 27.7 10
## 10 0 2 40 38.4 20
## # … with 1,055 more rows
A more informative/readable thing to look at is from the function str(). Read this as “str or cancer_df”.
str(cancer_df)
## spec_tbl_df [1,065 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ case : num [1:1065] 1 0 0 0 0 0 0 0 0 0 ...
## $ ARM : num [1:1065] 1 1 1 2 1 1 2 1 1 2 ...
## $ AGE : num [1:1065] 38 43 46 52 59 62 35 58 51 40 ...
## $ BMI : num [1:1065] 21.8 32.3 23 19.6 26.2 23.7 27.9 25.8 27.7 38.4 ...
## $ density: num [1:1065] 40 5 45 40 40 80 25 15 10 20 ...
## - attr(*, "spec")=
## .. cols(
## .. case = col_double(),
## .. ARM = col_double(),
## .. AGE = col_double(),
## .. BMI = col_double(),
## .. density = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
It is important to check that the data are what you expect, i.e., columns that you expect to be numbers are represented that way. Read through each line of the str ouput (i.e., each column of the data) and think about what it is, what the values are, etc..
Another useful thing to look at is the summary of a data frame, or the head (i.e., the first few rows).
summary(cancer_df)
## case ARM AGE BMI
## Min. :0.0000 Min. :1.000 Min. :35.00 Min. :17.60
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:46.00 1st Qu.:23.20
## Median :0.0000 Median :1.000 Median :49.00 Median :25.70
## Mean :0.1155 Mean :1.476 Mean :50.17 Mean :26.72
## 3rd Qu.:0.0000 3rd Qu.:2.000 3rd Qu.:54.00 3rd Qu.:29.40
## Max. :1.0000 Max. :2.000 Max. :70.00 Max. :50.40
## NA's :16
## density
## Min. : 0.00
## 1st Qu.: 15.00
## Median : 40.00
## Mean : 44.45
## 3rd Qu.: 70.00
## Max. :100.00
##
head(cancer_df)
## # A tibble: 6 × 5
## case ARM AGE BMI density
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 38 21.8 40
## 2 0 1 43 32.3 5
## 3 0 1 46 23 45
## 4 0 2 52 19.6 40
## 5 0 1 59 26.2 40
## 6 0 1 62 23.7 80
The gtsummary package offers formatted tables to get a quick summaries of the data. The default summary statistics are median (IQR) for numerical data and n(%) for categorical data. You can change these defaults as you wish. Here are a few examples:
library(gtsummary)
cancer_df %>%
tbl_summary()
| Characteristic | N = 1,0651 |
|---|---|
| case | 123 (12%) |
| ARM | |
| 1 | 558 (52%) |
| 2 | 507 (48%) |
| AGE | 49 (46, 54) |
| BMI | 25.7 (23.2, 29.4) |
| Unknown | 16 |
| density | 40 (15, 70) |
|
1
n (%); Median (IQR)
|
|
In this example, there are two treatment groups. We can summarise characteristics by treatment group.
cancer_df %>%
tbl_summary(by = ARM) %>%
add_overall(last = TRUE)
| Characteristic | 1, N = 5581 | 2, N = 5071 | Overall, N = 1,0651 |
|---|---|---|---|
| case | 72 (13%) | 51 (10%) | 123 (12%) |
| AGE | 49 (46, 54) | 49 (46, 54) | 49 (46, 54) |
| BMI | 25.8 (23.4, 29.1) | 25.5 (23.0, 29.7) | 25.7 (23.2, 29.4) |
| Unknown | 6 | 10 | 16 |
| density | 45 (15, 70) | 40 (20, 70) | 40 (15, 70) |
|
1
n (%); Median (IQR)
|
|||
You aren’t alone in your R journey. You should always expect to make use of the experience of those around you, this includes:
The most useful thing you can do to make your life easier is to practice. If you don’t use R for a few months, you will likely forget and you will have to refresh, whereas the more often you practice the less likely to are to forget and the easier life will be.